## Uncomment and run this cell to install pandas and numpy
#!pip install pandas numpyCleaning the weather dataset
In this notebook, we’ll be using numpy and pandas.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
Numpy is the fundamental package for scientific computing with Python.
Let’s install the packages pandas and numpy.
# import the libraries
import pandas as pd
import numpy as np
from dataidea.datasets import loadDatasetLet’s check the versions of python, numpy and pandas we’ll be using for this notebook
# checking python version
print('Python Version: ',)
!python --versionPython Version:
Python 3.10.12
# Checking numpy and pandas versions
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)Pandas Version: 2.2.1
Numpy Version: 1.26.4
Let’s load the dataset. We’ll be using a weather dataset that imagined for learning purposes.
# load the dataset
weather_data = loadDataset('weather')We can sample out random rows from the dataset using the sample() method, we can use the n parameter to specify the number of rows to sample
# sample out random values from the dataset
weather_data.sample(n=5)| day | temperature | windspead | event | |
|---|---|---|---|---|
| 2 | 05/01/2017 | 28.0 | NaN | Snow |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 1 | 04/01/2017 | NaN | 9.0 | Sunny |
| 4 | 07/01/2017 | 32.0 | NaN | Rain |
Display some info about the dataset eg number of entries, count of non-null values and variable datatypes using the info() method
# get quick dataframe info
weather_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 day 9 non-null object
1 temperature 5 non-null float64
2 windspead 5 non-null float64
3 event 7 non-null object
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes
We can count all missing values in each column in our dataframe by using dataframe.isna().sum(), eg
# count missing values in each column
weather_data.isna().sum()day 0
temperature 4
windspead 4
event 2
dtype: int64
We can use a boolean-indexing like technique to find all rows in a dataset with missing values in a specific column.
# get rows with missing data in temperature
weather_data[weather_data.temperature.isna()]| day | temperature | windspead | event | |
|---|---|---|---|---|
| 1 | 04/01/2017 | NaN | 9.0 | Sunny |
| 3 | 06/01/2017 | NaN | 7.0 | NaN |
| 5 | 08/01/2017 | NaN | NaN | Sunny |
| 6 | 09/01/2017 | NaN | NaN | NaN |
# get rows with missing data in event column
weather_data[weather_data.event.isna()]| day | temperature | windspead | event | |
|---|---|---|---|---|
| 3 | 06/01/2017 | NaN | 7.0 | NaN |
| 6 | 09/01/2017 | NaN | NaN | NaN |
For the next part, we would like to demonstrate forward fill (ffill()) and backward fill (bfill), we first create two copies of the dataframe to avoid modifying our original copy in memory. - ffill() fills the missing values with the previous valid value in the column - bfill() fills the missing values with the next valid value in the column
# Create copies of a dataframe
weather_data1 = weather_data.copy()
weather_data2 = weather_data.copy()# fill with the previous valid value
weather_data1['event'] = weather_data1.event.ffill()
weather_data1| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | NaN | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | NaN | Snow |
| 3 | 06/01/2017 | NaN | 7.0 | Snow |
| 4 | 07/01/2017 | 32.0 | NaN | Rain |
| 5 | 08/01/2017 | NaN | NaN | Sunny |
| 6 | 09/01/2017 | NaN | NaN | Sunny |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
weather_data| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | NaN | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | NaN | Snow |
| 3 | 06/01/2017 | NaN | 7.0 | NaN |
| 4 | 07/01/2017 | 32.0 | NaN | Rain |
| 5 | 08/01/2017 | NaN | NaN | Sunny |
| 6 | 09/01/2017 | NaN | NaN | NaN |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
# fill with the next valid value in the column
weather_data2['event'] = weather_data2.event.bfill()
weather_data2| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | NaN | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | NaN | Snow |
| 3 | 06/01/2017 | NaN | 7.0 | Rain |
| 4 | 07/01/2017 | 32.0 | NaN | Rain |
| 5 | 08/01/2017 | NaN | NaN | Sunny |
| 6 | 09/01/2017 | NaN | NaN | Cloudy |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can modify (or fill) a specific value in the dataframe by using the loc[] method. This picks the value by its row (index) and column names. Assigning it a new value modifies it in the dataframe as illustrated below
# modify a specific value in the dataframe
weather_data2.loc[1, 'temperature'] = 29
weather_data2| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | NaN | Snow |
| 3 | 06/01/2017 | NaN | 7.0 | Rain |
| 4 | 07/01/2017 | 32.0 | NaN | Rain |
| 5 | 08/01/2017 | NaN | NaN | Sunny |
| 6 | 09/01/2017 | NaN | NaN | Cloudy |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can use the fillna() method to replace all missing values in a column with a specific value as demostrated value
# replace missing values in temperature column with mean
weather_data2['temperature'] = weather_data2.temperature.fillna(
value=weather_data2.temperature.mean()
)
weather_data2| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | NaN | Snow |
| 3 | 06/01/2017 | 32.5 | 7.0 | Rain |
| 4 | 07/01/2017 | 32.0 | NaN | Rain |
| 5 | 08/01/2017 | 32.5 | NaN | Sunny |
| 6 | 09/01/2017 | 32.5 | NaN | Cloudy |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
# create a copy of weather_data2
weather_data22 = weather_data2.copy()# Replace missing values in windspead column with a specific value
weather_data22['windspead'] = weather_data2.windspead.fillna(value=7.5)
weather_data22| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | 7.5 | Snow |
| 3 | 06/01/2017 | 32.5 | 7.0 | Rain |
| 4 | 07/01/2017 | 32.0 | 7.5 | Rain |
| 5 | 08/01/2017 | 32.5 | 7.5 | Sunny |
| 6 | 09/01/2017 | 32.5 | 7.5 | Cloudy |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can also use the fillna() method to fill missing values in multiple columns by passing in the dictionary of key/value pairs of column-name and value to replace. Before we demonstrate this, let’s create a copy of the dataframe to avoid modifying the original in memory
# create a copy of the weather_data dataframe
weather_data3 = weather_data.copy()# Replace missing values in temperature, column and event
weather_data3.fillna(value={
'temperature': weather_data3.temperature.mean(),
'windspead': weather_data3.windspead.max(),
'event': weather_data3.event.bfill()
}, inplace=True)weather_data3| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 1 | 04/01/2017 | 33.2 | 9.0 | Sunny |
| 2 | 05/01/2017 | 28.0 | 12.0 | Snow |
| 3 | 06/01/2017 | 33.2 | 7.0 | Rain |
| 4 | 07/01/2017 | 32.0 | 12.0 | Rain |
| 5 | 08/01/2017 | 33.2 | 12.0 | Sunny |
| 6 | 09/01/2017 | 33.2 | 12.0 | Cloudy |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can optionally drop all rows with missing values using the dropna() method. Before we demonstrate that, let’s first create a dataframe copy to avoid modifying the original in the memory
# create a copy of weather_data dataframe
weather_data4 = weather_data.copy()# Drop all rows with missing values
weather_data4.dropna()| day | temperature | windspead | event | |
|---|---|---|---|---|
| 0 | 01/01/2017 | 32.0 | 6.0 | Rain |
| 7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
| 8 | 11/01/2017 | 40.0 | 12.0 | Sunny |